fix nio read/write iovec bugs causing deadlock and infinite loop#442
fix nio read/write iovec bugs causing deadlock and infinite loop#442loongs-zhang merged 3 commits intomasterfrom
Conversation
Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/3ee10e36-26e4-49a6-88c3-bc465b166e2f Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
…rst data) Agent-Logs-Url: https://github.com/acl-dev/open-coroutine/sessions/3ee10e36-26e4-49a6-88c3-bc465b166e2f Co-authored-by: loongs-zhang <38336731+loongs-zhang@users.noreply.github.com>
loongs-zhang
left a comment
There was a problem hiding this comment.
readv/read_vectored semantics say: return as soon as any data is available. But our implementation assumes the socket is blocking, and then uses non-blocking to achieve the same effect as blocking. Therefore, strictly speaking, there is no need to follow the readv/read_vectored semantics at all, unless the readv/read_vectored semantics are the same for both blocking and non-blocking.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #442 +/- ##
==========================================
- Coverage 75.45% 74.53% -0.92%
==========================================
Files 45 45
Lines 4742 4744 +2
==========================================
- Hits 3578 3536 -42
- Misses 1164 1208 +44 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Blocking and non-blocking The test assertion |
36e82c0 to
74db854
Compare
|
@copilot Locate root cause of https://github.com/acl-dev/open-coroutine/actions/runs/25242137815/job/74019781122?pr=442 and fix. |
Root cause of job 74019781122: The SIGSEGV in After Fix (already in
Local verification:
|
loongs-zhang
left a comment
There was a problem hiding this comment.
@copilot https://github.com/acl-dev/open-coroutine/actions/runs/25242427251/job/74020614157?pr=442 and https://github.com/acl-dev/open-coroutine/actions/runs/25242427251/job/74020614232?pr=442 got dead, locate the root cause and fix.
Both jobs deadlock with the same symptom: server coroutine stuck in Key observations
Both fail only in release mode, pointing to a compiler-optimisation-sensitive path. Code issue found The // BUG (line 828) – uses cumulative `length` (after += iov_len), not the per-iovec base
offset = sent.saturating_sub(length);
// Correct (as in impl_nio_read_iovec, line 612)
offset = sent.saturating_sub(stage); // stage = length before += iov_lenWhen writev returns a partial count within one iovec and errno is then set to The fix is to introduce Unfortunately I ran out of time before I could commit and validate that fix — I wanted to be confident about the change before pushing it given how subtle the timing-sensitive behaviour is. |
There was a problem hiding this comment.
Pull request overview
Fixes NIO readv/iovec handling to correctly resume filling the current iovec after a prior syscall returned bytes spanning across iovec boundaries, addressing deadlock/infinite-loop behavior observed in socket_co-style tests.
Changes:
- Adjust
impl_nio_read_iovec(Unix + Windows) to compute per-iovec offsets from the iovec start (stage) rather than the cumulative length. - On Windows, update iovec read/write accounting to use the
lpNumberOfBytes{Recv,Sent}out-parameter instead of theWSA{Recv,Send}return value.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| core/src/syscall/windows/mod.rs | Fixes read-iovec offset calculation; uses WinSock out-params for byte counts (but leaves a write-iovec offset bug and introduces raw-pointer deref risks). |
| core/src/syscall/unix/mod.rs | Fixes read-iovec offset calculation; write-iovec path appears to still have the analogous offset bug (per PR intent). |
Comments suppressed due to low confidence (1)
core/src/syscall/windows/mod.rs:814
- The
offsetrecomputation after a partial WSASend usessent.saturating_sub(length), butlengthis the cumulative length after adding the current iovec. Whensent < length, this yields 0 and causes subsequent retries to resend from the start of the current iovec (data corruption and/or retry loops). Track the per-iovec start (stage = lengthbefore incrementing) and compute offset assent.saturating_sub(stage)(matching the read-iovec fix).
r = 0;
unsafe{ $sent.write(sent.try_into().expect("overflow")) };
break;
}
offset = sent.saturating_sub(length);
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ); | ||
| if r != windows_sys::Win32::Networking::WinSock::SOCKET_ERROR { | ||
| $crate::syscall::reset_errno(); | ||
| received += usize::try_from(r).expect("overflow"); | ||
| received += unsafe{ usize::try_from(*$recvd).expect("overflow") }; | ||
| if received >= length { |
| if r != windows_sys::Win32::Networking::WinSock::SOCKET_ERROR { | ||
| $crate::syscall::reset_errno(); | ||
| sent += usize::try_from(r).expect("overflow"); | ||
| sent += unsafe{ usize::try_from(*$sent).expect("overflow")}; | ||
| if sent >= length { | ||
| r = 0; |
| for iovec in &vec { | ||
| let mut offset = received.saturating_sub(length); | ||
| let stage = length; | ||
| let mut offset = received.saturating_sub(stage); | ||
| length += iovec.iov_len; | ||
| if received > length { |
74db854 to
fd44e97
Compare
socket_codeadlock (server stuck in recv, client never runs)impl_nio_read_iovecin commit 74db854impl_nio_write_iovechas a bug: useslength(post-+= iov_len) instead ofstage(pre-+= iov_len) when updatingoffseton partial write — consistent withimpl_nio_read_iovecwhich correctly usesstage